library(magrittr)

The goal here is to explain and understand the cross-validation-plus procedure from Barber et al. (2021). We ultimately want to consider applying it to obtain prediction intervals for PGS in the UKB study.

I initially thought that the notation was a little tricky, but it’s really not that bad.

Overview of CV-plus

Equation 3.1 in Barber et al. (2021), which I reproduce below, defines the prediction interval for a single new point \(Z_{n + 1} = (X_{n +1}, Y_{n+1})\).

\[\begin{equation} \hat C_{n, K, \alpha}^{CV+}(X_{n + 1}) = \left[ \hat q_{n, \alpha}^-\lbrace \hat \mu_{- S_{k(i)}}(X_{n + 1}) - R_i^{CV}\rbrace, \hat q_{n, \alpha}^+\lbrace \hat \mu_{- S_{k(i)}}(X_{n + 1}) + R_i^{CV}\rbrace \right] \end{equation}\]

where

\[\begin{equation} R_i^{CV} = |Y_i - \hat\mu_{- S_{k(i)}}(X_i)| \end{equation}\]

Note that we need the absolute values of the residuals.

Thus, we must apply the quantile function to each of two collections of numbers, where the size of each collection is the number of training samples:

  1. \(\hat \mu_{- S_{k(i)}}(X_{n + 1}) - R_i^{CV}\)
  2. \(\hat \mu_{- S_{k(i)}}(X_{n + 1}) + R_i^{CV}\)

For example, for \(K = 5\), we have 5 distinct functions, \(\hat \mu_{-1}\), …, \(\hat \mu_{-5}\), where the subscript designates which subset of subjects is removed when estimating function parameters. Note that Barber et al. (2021) assumes that the 5 folds have the exact same number of subjects in them.

ryantibs/conformal repository

We probably don’t want to use Tibshirani’s repo, as it seems to lack a function for CV+.

Custom R code

It should be straightforward to use DBSLMM outputs and fam files to write custom code for CV+.

Inputs needed:

Note that we also need to use Sheng’s validation set, which means that we need to know which subjects, from the ~340,000 total, are in the validation set. I believe that Sheng has text files that indicate which 1000 subjects went into the validation set, but I need to verify this.

In previous analyses, I’ve calculated predicted values for only the held-out subjects for each cross validation. Here, for CV+, I essentially need the predicted values for all ~337,000 subjects for each cross validation fold.

Thus, I’ll need to create a new bash file that adapts my (ie, Sheng’s) bash code to ensure that all subjects get a predicted value calculated for each of the five folds. My current bash script calculates predicted values for only the held-out subjects for each fold. I can change this by altering (or removing?) the keep argument when calling plink’s score function.

So, it turns out that what I actually need is to use Sheng’s “validation” set subjects as the \(X_{n+1}\) and \(Y_{n + 1}\)

I also need to evaluate \(\mu_{- S_{k(i)}}(X_{n + 1})\) for each of the five (or \(K\) for \(K\)-fold CV+) “mu-hat” functions. Thus, I’ll need to treat what Sheng is calling the validation set as a “true test set”. I’ll need to read their genotypes and their true trait values. The true trait values are in a fam file, but I’m not sure which. I need to check on this. It’s possible that I need to process the traits, but I’m guessing that Sheng has already done so.

First step is to find the file (in Sheng’s directory) that tells me which subjects, from the nearly 340,000, go into the validation sets for each trait.

Once I know which subjects go into the validation set, I can do the allele scoring in plink for their genotype data. Obviously, I need to know where to find their genotypes data. I seem to recall that Sheng has a directory with the validation genotype data. I’ll look for it. Alternatively, he might just have it in the “predictionProject” directory, under “plink files”. I could use them if I can get the indices for the “validation set” subjects.

It looks like the directories that I need are in comparisonProject:

~/research/comparisonProject/03_subsample/continuous/pheno1/val/ukb/ and similarly named directories. The processed phenotypes, I think, for the validation set are here, as are the plink files for the validation set. The above files are written in Sheng’s file, “comparisonProject/code/01_process_dat/05_get_subsample.sh”

comparisonProject/03_subsample/continuous/pheno1/val/ukb/01_idx.txt (and similarly named) is the file that indicates which subjects are in the validation set for each trait. It contains two identical columns for identifying subjects.

How to use the results files from DBSLMM

alpha = 0.2
out <- list()
res_file <- "res.rds"
if (!file.exists(res_file)){
  res <- lapply(1:25, function(x) {
    knitr::knit_child(
      'cv-plus-child.Rmd', 
      envir = environment(), 
      quiet = TRUE
    )
    get("out")
  })
  saveRDS(res, file = res_file)
} else {
  res <- readRDS(res_file)
}
coverages <- lapply(X = res, 
                    FUN = function(mytib){
  mytib %>%
    dplyr::summarise(coverage = mean(in_interval))
       }
)
coverage_tib <- do.call("rbind", coverages) %>%
  tibble::as_tibble() %>%
  dplyr::mutate(trait_num = 1:length(res))
coverage_tib %>%
  gt::gt()
coverage trait_num
0.919 1
0.905 2
0.917 3
0.917 4
0.909 5
0.920 6
0.914 7
0.917 8
0.906 9
0.919 10
0.898 11
0.899 12
0.910 13
0.906 14
0.893 15
0.903 16
0.897 17
0.902 18
0.898 19
0.910 20
0.907 21
0.897 22
0.902 23
0.908 24
0.896 25

Boxplots of interval width

for (i in 1:length(res)){
  res[[i]] <- res[[i]] %>%
    dplyr::mutate(trait_num = i)
}
dplyr::bind_rows(res) %>%
  ggplot2::ggplot() + 
  ggplot2::geom_boxplot(ggplot2::aes(x = trait_num, y = interval_width, group = trait_num))

Incorporate estimated heritability of traits

h_tib <- readr::read_csv("~/research/ukb-intervals/shell_scripts/h2.csv", col_names = FALSE) %>% dplyr::select(-26) # drop 26, which is due to presence of trailing commas
## Rows: 5 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (25): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
## lgl  (1): X26
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pp <- h_tib %>% 
  dplyr::summarise(dplyr::across(X1:X25, mean)) %>% 
  tidyr::pivot_longer(X1:X25) %>%
  dplyr::rename(mean_herit = value) %>%
  dplyr::mutate(trait_num = 1:25) %>%
  dplyr::select(-name) %>%
  dplyr::select(trait_num, mean_herit) %>%
  dplyr::left_join(dplyr::bind_rows(res), by = "trait_num") 
pp2 <- pp %>%
  ggplot2::ggplot() + ggplot2::geom_boxplot(ggplot2::aes(x = mean_herit, y = interval_width, group = trait_num, colour = as.factor(trait_num), label = trait_num), show.legend = FALSE)
## Warning: Ignoring unknown aesthetics: label
  pp2 %>% plotly::ggplotly(tooltip = "label")

Plot coverage vs (mean) heritability

pp2 <- pp %>%
  dplyr::group_by(trait_num) %>%
  dplyr::summarise(mean_herit = mean(mean_herit), coverage = mean(in_interval)) %>%
  ggplot2::ggplot() + ggplot2::geom_point(ggplot2::aes(x = mean_herit, y = coverage, colour = as.factor(trait_num))) 
pp2 %>%
  plotly::ggplotly()

Check correlation between (mean) heritability and coverage

summ <- pp %>%
  dplyr::group_by(trait_num) %>%
  dplyr::summarise(mean_herit = mean(mean_herit), coverage = mean(in_interval))
cor(summ$mean_herit, summ$coverage)
## [1] 0.4413399

References

Barber, Rina Foygel, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. 2021. “Predictive Inference with the Jackknife+.” The Annals of Statistics 49 (1): 486–507.